A new look on the genotype-by-environment interaction: enviromics and probabilistic models
8th International Meeting on Plant Breeding
1 Introduction
This guide will help you using ProbBreed’s functions, from fitting Bayesian models to computing probabilities. If you need more theoretical details, check the papers Dias et al. (2022) and Chaves et al. (2024). The vignette and the help page of each function (for e.g., help(prob_sup)) have more details on the functions arguments. Also, make sure you have package installed. For the CRAN (stable) version:
install.packages("ProbBreed")For installing the development version (everything we do before submitting to CRAN), use:
# install.packages("devtools")
devtools::install_github("saulo-chaves/ProbBreed")After installing, load the package:
library(ProbBreed)2 Available datasets
The package contains two dataset that can be used as examples to run the functions: soy and maize. soy belongs to the USDA Northern Region Uniform Soybean Tests, and it is a subset of the data used by Krause et al. (2023). It contains the empirical best linear unbiased estimates of genotypic means of the seed yield from 39 experimental genotypes evaluated in 14 locations.
head(soy) Loc Gen Y
1 E01 G01 49.89367
2 E01 G02 48.51867
3 E01 G03 49.81867
4 E01 G04 47.31867
5 E01 G05 47.91867
6 E01 G06 53.01867
maize belongs to value of cultivation and use maize trials of Embrapa Maize and Sorghum, and was used by Dias et al. (2022). It contains the grain yield of 32 single-cross hybrids and four commercial checks (36 genotypes in total) evaluated in 16 locations across five regions or mega-environments. These trials were laid out in incomplete blocks design, using a block size of 6 and two replications per trial
head(maize) Region Location Rep Block Hybrid GY
1 TB E10 1 3 G9 9.65
2 TB E10 1 6 G20 6.72
3 TB E10 1 5 G34 6.67
4 TB E10 1 4 G18 7.69
5 TB E10 1 6 G30 8.99
6 TB E10 1 5 G27 9.53
We will begin our tutorial with the soy dataset, and end with the maize dataset. We will explain the reason for this in a pinch.
3 First step: fitting the Bayesian model
The first step is to fit a multi-environment Bayesian model using the bayes_met function. This function has a set of predefined models that are used to run Bayesian analyses using rstan. Currently, the function has nine options of models (Figure 1).
repl), years (year) and regions (reg) effects in the bayes_met function. Users must substitute Repl, Block, Year, and Region by the name of the column that contains the information of replicates, blocks (if applicable), year (if available) and region (if available). RCDB and IBD are acronyms for randomized complete block design and incomplete block design, respectively. A model using only adjusted means can only be fitted with reg = NULL and year = NULL
These models differ according to the considered information: only locations, locations and years, locations and breeding regions, and locations, years and breeding regions. Should you consider an “environment” as a combination of environmental factors (for instance, location and planting dates within location), this information must be in the loc argument. All models will have gen and loc. If you want to consider the effect of regions (or mega-environment) and/or time factors (for instance, years or harvests), then reg and year must represent the column that contains the information on these effects. Otherwise, these arguments are set to NULL. You may control the experimental design effects in the repl argument. If repl = NULL, bayes_met will assume that you are entering with adjusted means of each environment, i.e., data without replicates. If you have data from trials laid out in randomized complete blocks, repl = 'Replicate'. Finally, if the multi-environment trials were laid out in incomplete blocks design, repl = c('Replicate', 'Block'). By default, when repl = NULL, reg and year must also be NULL.
The Bayesian models implemented in ProbBreed have predefined priors described by Dias et al. (2022). In summary: \[
x \sim \mathcal{N}\left(0,S^{[x]}\right)
\]
\[ \sigma \sim \mathcal{HalfCauchy}\left(0, S^{[\sigma]}\right) \]
where \(x\) can be any effect but the residual, and \(\sigma\) is the standard deviation of the likelihood. If an heterogeneous model is set (res.het = TRUE), then \(\sigma_k \sim HalfCauchy\left(0, S^{[\sigma_k]}\right)\).
The hyperpriors are set as follows:
\[ S^{[x]} \sim HalfCauchy(0, \phi) \]
where \(\phi\) is the known global hyperparameter defined such as \(\phi = max(y) \times 10\). By doing this, we restrain the hyperparameter to be finite and allow the model to take full advantage of the data to infer the posterior distribution since it provide weakly informative prior distributions, avoiding subjectivity that can possibly hamper the results. The half-Cauchy distribution was used to guarantee that variance components will not be negative.
Let’s check how can we fit the Bayesian models for the soy and maize datasets:
soy(we will run this)
mod_soy = bayes_met(data = soy,
gen = "Gen",
loc = "Loc",
repl = NULL,
trait = "Y",
res.het = TRUE,
iter = 2000,
cores = 4,
chains = 4)maize(homework: run this latter)
mod_maize = bayes_met(data = maize,
gen = "Hybrid",
loc = "Location",
repl = c("Rep","Block"),
trait = "GY",
reg = "Region",
year = NULL,
res.het = TRUE,
iter = 2000,
cores = 4,
chain = 4)You may change the number of iterations, chains, and cores. This will vary according to the data and the processing capacity of your machine. Be aware that the more iterations and chains, the longer the computation time. At the same time, results are likely more reliable, and mixing/convergence issues will diminish. By default, if more than one core is set, the function runs one chain per core. The number of cores depends on the processing capacity of your machine. Choose wisely.
ProbBreed uses rstan to fit Bayesian models, so packages and functions designed to manage and explore these models can be used in the output object of the bayes_met function. Some interesting options are rstantools, shinystan, and bayesplot.
Check out the shinystan package:
library(shinystan)
shinystan::launch_shinystan(mod)4 Second step: explore the model’s outputs
After fitting the model, the next step is to extract outputs using extr_outs. This function also provides other useful information, like variance components and posterior predictive checks. Using the mod_soy object from the previous step:
outs = extr_outs(model = mod_soy,
probs = c(0.05, 0.95),
verbose = TRUE)where:
modelis the model fitted usingbayes_metprobsare the probabilities considered to calculate the quantiles and the highest posterior density (HPD)
This function provides an object of class extr, which is a list with the posterior distribution of each effect, the data generated by the model, the maximum posterior values of each effect, the variances of each effect, and quality parameters of the model (see below).
outs$variances effect var sd naive.se HPD_0.05 HPD_0.95
1 Gen 3.373 1.308 0.021 1.696 5.814
2 Loc 246.870 125.446 1.983 117.417 475.870
3 error_env1 10.069 2.670 0.042 6.471 15.045
4 error_env2 29.046 7.055 0.112 19.479 42.455
5 error_env3 11.561 3.313 0.052 7.252 17.680
6 error_env4 18.536 4.907 0.078 11.982 27.364
7 error_env5 51.325 12.414 0.196 34.575 73.848
8 error_env6 15.730 4.132 0.065 10.276 23.350
9 error_env7 19.356 4.929 0.078 12.758 28.475
10 error_env8 21.170 5.503 0.087 13.824 31.311
11 error_env9 14.878 3.809 0.060 9.673 21.735
12 error_env10 13.528 5.673 0.090 6.863 24.331
13 error_env11 22.291 9.237 0.146 11.756 39.050
14 error_env12 7.450 3.236 0.051 3.610 13.345
15 error_env13 14.689 5.942 0.094 7.623 25.672
16 error_env14 10.986 4.363 0.069 5.690 19.079
outs$ppcheck Diagnostics
p.val_max 0.9170
p.val_min 0.3608
p.val_median 0.7418
p.val_mean 0.5130
p.val_sd 0.5540
Eff_No_parameters 27.3972
WAIC2 2715.5314
mean_Rhat 1.0011
Eff_sample_size 0.6399
The S3 method plot will generate some useful plots, like the density plots and histograms built from the posterior effects (Figure 2). It can also build trace plots.
plot(outs, category = "density")
plot(outs, category = "histogram")A particularly useful plot is the comparison between the empirical and sampled phenotype, which illustrates the model’s convergence (Figure 3). The more the density plots overlap, the more successful was the model in sampling the effects.
plot(outs)5 Third step: compute the probabilities
With the outputs extracted by extr_outs, we can calculate the probabilities of superior performance and superior stability of the evaluated genotypes with prob_sup:
results = prob_sup(extr = outs,
int = .2,
increase = TRUE,
save.df = FALSE,
verbose = FALSE)where:
increase: The objective is for increasing (TRUE, default) or decreasing (FALSE) the trait meanextr: Anextrobject that contains the outputs of theextr_outs()function.int: The selection intensity, expressed in decimal values. In the example, the selection intensity is 20%.save.df:TRUEif you want to save the data frames containing the calculated probabilities in the working directory,FALSEotherwise.
This function generates an object of class probsup, which consists of two lists: across and within. The first list contains the probabilities of superior performance and superior stability across environments, while the second contains the probabilities of superior performance within environments. probsup is also compatible with the S3 method plot. See the details below or in help("plot.probsup")
Within-environment probabilities of superior performance, and probabilities of superior stability are unavailable for entry-mean models (i.e., the one we used for the soy data), since we have no explicit genotype \(\times\) environment interaction effect in this case.
Let’s check the available outputs for soy:
5.1 HPD of the genotypic effects
head(results$across$g_hpd) gen g HPD95 HPD97.5 HPD5 HPD7.5
1 G01 1.3762307 2.9176431 3.2627871 -0.1692294 -0.4855525
2 G02 -1.4201714 0.4352518 0.8312823 -3.3015241 -3.6673793
3 G03 0.3276901 2.1634450 2.4905701 -1.5210447 -1.8249005
4 G04 -1.0515739 0.7217608 1.1063974 -2.9132463 -3.2910979
5 G05 -2.0253358 -0.1852200 0.1379327 -4.0239070 -4.3471062
6 G06 0.3947565 2.1778044 2.5478518 -1.4134656 -1.7735013
plot(results, category = "hpd")5.2 Probability of superior performance
\[ Pr\left(\hat{g}_j \in \Omega \vert y\right) = \frac{1}{S} \sum_{s=1}^S I\left(\hat{g}_j^{(s)} \in \Omega \vert y\right) \]
head(results$across$perfo) ID prob
36 G36 0.98175
9 G09 0.84125
20 G20 0.82800
38 G38 0.72025
31 G31 0.66900
1 G01 0.51575
plot(results)5.3 Pairwise probability of superior performance
\[ Pr\left(\hat{g}_{j} > \hat{g}_{j^\prime} \vert y\right) = \frac{1}{S} \sum_{s=1}^S I\left(\hat{g}_{j}^{(s)} > \hat{g}_{j^\prime}^{(s)} \vert y\right) \]
This equation is adequate for when we want to increase the trait value. If we set the argument increase = FALSE, then \(>\) is changed to \(<\) in the equation above.
results$across$pair_perfo[1:5, 1:5] G01 G02 G03 G04 G05
G01 0.00000 0.02475 0.23725 0.0445 0.00725
G02 0.97525 0.00000 0.86450 0.5985 0.34250
G03 0.76275 0.13550 0.00000 0.1815 0.06350
G04 0.95550 0.40150 0.81850 0.0000 0.25700
G05 0.99275 0.65750 0.93650 0.7430 0.00000
plot(results, category = "pair_perfo")This is as far as we can go with the soy data. Now, let’s shift to the maize data. To speed things up, let’s load the object containing the probabilities:
probs_maize = readRDS(file = "probs_maize.rds")5.4 Probability of superior stability
\[ Pr\left[var\left(\widehat{ge}_{jk}\right) \in \Omega \vert y\right] = \frac{1}{S} \sum_{s=1}^S I\left[var\left(\widehat{ge}_{jk}^{(s)}\right) \in \Omega \vert y\right] \]
Note that this probability can only be computed across environments since it depends on \(var\left(\widehat{ge}_{jk}\right)\).
head(probs_maize$across$stabi$gl) ID prob
8 G16 0.52225
25 G31 0.46600
28 G34 0.40550
4 G12 0.39975
24 G30 0.37625
29 G35 0.37100
head(probs_maize$across$stabi$gm) ID prob
36 G9 0.27375
32 G5 0.26775
28 G34 0.26675
18 G25 0.26075
24 G30 0.26050
26 G32 0.25900
plot(probs_maize, category = "stabi")5.5 Pairwise probability of superior stability
\[ Pr\left[var\left(\widehat{ge_{jk}}\right) < var\left(\widehat{ge}_{ik}\right) \vert y\right] = \frac{1}{S} \sum_{s=1}^S{I \left[var\left(\widehat{ge}_{jk}^{(s)}\right) < var\left(\widehat{ge}_{ik}^{(s)}\right) \vert y\right]} \]
Note that, in this case, we are interested in the genotype that has a lower variance of the genotype-by-environment (or genotype-by-region) interaction effects.
probs_maize$across$pair_stabi$gl[1:5, 1:5] G1 G10 G11 G12 G13
G1 0.00000 0.74525 0.75875 0.88250 0.7850
G10 0.25475 0.00000 0.52025 0.70625 0.5495
G11 0.24125 0.47975 0.00000 0.69650 0.5410
G12 0.11750 0.29375 0.30350 0.00000 0.3470
G13 0.21500 0.45050 0.45900 0.65300 0.0000
probs_maize$across$pair_stabi$gm[1:5, 1:5] G1 G10 G11 G12 G13
G1 0.00000 0.56225 0.58550 0.56725 0.57525
G10 0.43775 0.00000 0.52250 0.51900 0.52425
G11 0.41450 0.47750 0.00000 0.49650 0.49525
G12 0.43275 0.48100 0.50350 0.00000 0.49850
G13 0.42475 0.47575 0.50475 0.50150 0.00000
plot(probs_maize, category = "pair_stabi")5.6 Joint probability of superior performance and superior stability
\[ Pr\left[\hat{g}_j \in \Omega \cap var\left(\widehat{ge}_{jk}\right) \in \Omega\right] = Pr\left(\hat{g}_j \in \Omega\right) \times Pr\left[var\left(\widehat{ge}_{jk}\right) \in \Omega\right] \]
head(probs_maize$across$joint) ID level category prob
145 G1 Location Joint 0.026392500
146 G10 Location Joint 0.000000000
147 G11 Location Joint 0.006300000
148 G12 Location Joint 0.151805062
149 G13 Location Joint 0.133568750
150 G14 Location Joint 0.001108125
plot(probs_maize, category = "joint")5.7 Probabilities within environments
When GEI effects are estimated (in all situation but the entry-mean model), we can calculate probabilities of superior performance within individual environments:
5.7.1 Probability of superior performance
Instead of using the marginal genotypic effect of each genotype, we use the within environment genotypic effect \(\left(g_{jk} = g_j + ge_{jk}\right)\):
\[ Pr\left(\hat{g}_{jk} \in \Omega_k \vert y\right) = \frac{1}{S} \sum_{s=1}^S I\left(\hat{g}_{jk}^{(s)} \in \Omega_k \vert y\right) \]
head(probs_maize$within$perfo$gl) gen E1 E10 E11 E12 E13 E14 E15 E16 E2
1 G1 0.98325 0.26325 0.38075 0.44300 0.27800 0.94925 0.32450 0.97100 0.89000
2 G10 0.01125 0.00025 0.00075 0.00300 0.02125 0.07200 0.00000 0.04075 0.10100
3 G11 0.00550 0.09850 0.11325 0.00375 0.61600 0.10925 0.75000 0.09850 0.01850
4 G12 0.15950 0.08575 0.04500 0.06200 0.06050 0.30525 0.16000 0.56775 0.42625
5 G13 0.20900 0.18625 0.13600 0.13175 0.06300 0.58950 0.81750 0.06975 0.22175
6 G14 0.00000 0.00450 0.10925 0.16250 0.02975 0.06000 0.00075 0.02325 0.00225
E3 E4 E5 E6 E7 E8 E9
1 0.63450 0.91950 0.52275 0.30625 0.00350 0.85675 0.37450
2 0.20125 0.00200 0.04025 0.01875 0.00125 0.00050 0.04500
3 0.03050 0.02025 0.11150 0.33350 0.01575 0.04425 0.23275
4 0.59875 0.18200 0.37100 0.45225 0.81950 0.15950 0.30325
5 0.70250 0.69550 0.60050 0.82175 0.46250 0.11825 0.53675
6 0.04975 0.13150 0.11350 0.62225 0.02850 0.03475 0.07025
plot(probs_maize, category = "perfo", level = "within")5.7.2 Pairwise probability of superior performance
\[ Pr\left(\hat{g}_{jk} > \hat{g}_{j^\prime k} \vert y\right) = \frac{1}{S} \sum_{s=1}^S I\left(\hat{g}_{jk}^{(s)} > \hat{g}_{j^\prime k}^{(s)} \vert y\right) \]
Or \(<\) instead of \(>\) if increase = TRUE
The comparison are made per environment, so it impossible to illustrate them all in a single plot. plot.probsup will ask you if you are storing the figures in an object. If not, the function won’t build the plots. Long story short: store the outcomes of plot.probsup in an object in the case of within-environment pairwise probability of superior performance.
head(probs_maize$within$perfo$gl)
pairsupwithin = plot(probs_maize, category = "pair_perfo", level = "within")
pairsupwithin$gl$E16
pairsupwithin$gm$NE gen E1 E10 E11 E12 E13 E14 E15 E16 E2
1 G1 0.98325 0.26325 0.38075 0.44300 0.27800 0.94925 0.32450 0.97100 0.89000
2 G10 0.01125 0.00025 0.00075 0.00300 0.02125 0.07200 0.00000 0.04075 0.10100
3 G11 0.00550 0.09850 0.11325 0.00375 0.61600 0.10925 0.75000 0.09850 0.01850
4 G12 0.15950 0.08575 0.04500 0.06200 0.06050 0.30525 0.16000 0.56775 0.42625
5 G13 0.20900 0.18625 0.13600 0.13175 0.06300 0.58950 0.81750 0.06975 0.22175
6 G14 0.00000 0.00450 0.10925 0.16250 0.02975 0.06000 0.00075 0.02325 0.00225
E3 E4 E5 E6 E7 E8 E9
1 0.63450 0.91950 0.52275 0.30625 0.00350 0.85675 0.37450
2 0.20125 0.00200 0.04025 0.01875 0.00125 0.00050 0.04500
3 0.03050 0.02025 0.11150 0.33350 0.01575 0.04425 0.23275
4 0.59875 0.18200 0.37100 0.45225 0.81950 0.15950 0.30325
5 0.70250 0.69550 0.60050 0.82175 0.46250 0.11825 0.53675
6 0.04975 0.13150 0.11350 0.62225 0.02850 0.03475 0.07025
Are you using an object to store the outputs of this function? Enter Y/n:
Like the other heatmaps, we are evaluating the probability of genotypes at x-axis being superior than genotypes at the y-axis.
6 Wrap up
The estimated probabilities demonstrated in this tutorial from ProbBreed are related to some key question that constantly arises in plant breeding:
What is the risk of recommending a selection candidate?
What is the probability of a given selection candidate having good performance across or within environments?
What is the probability of a selection candidate having better performance than a check cultivar check?
How probable is it that a given selection candidate performs similarly across environments?
What are the chances that a given selection candidate is more stable than a cultivar check?
What is the probability that a given selection candidate having a superior and invariable performance across environments?
If you used it for a scientific publication, please, cite it:
citation("ProbBreed")Thank you for citing ProbBreed in your publication! Please, use:
Chaves SFS, Krause MD, Dias LAS, Garcia AAF, Dias KOG (2024).
"ProbBreed: a novel tool for calculating the risk of cultivar
recommendation in multienvironment trials." _G3
Genes|Genomes|Genetics_, *14*(3), jkae013.
doi:10.1093/g3journal/jkae013
<https://doi.org/10.1093/g3journal/jkae013>.
Uma entrada BibTeX para usuários(as) de LaTeX é
@Article{ProbBreed_paper,
title = {ProbBreed: a novel tool for calculating the risk of cultivar recommendation in multienvironment trials},
author = {Saulo F. S. Chaves and Matheus D. Krause and Luiz A. S. Dias and Antonio A. F. Garcia and Kaio O. G. Dias},
journal = {G3 Genes|Genomes|Genetics},
year = {2024},
volume = {14},
number = {3},
pages = {jkae013},
doi = {10.1093/g3journal/jkae013},
}